Lecture 2 : Stochastic Multi - armed bandit ( IID model )
نویسنده
چکیده
1.1 Reward model The next step in defining the model is to describe how the reward are generated. This is where the stochastic assumption and IID model comes in. In particular, we assume that the reward from each arm i follows a distribution νi with mean μi. When pulling an arm i, the reward will be generated independently from the distribution νi. More precisely, let Ht−1 denote the history until time t− 1 (including t− 1). We can write Ht−1 = {(I1, r1), . . . , (It−1, rt−1)}. Then, our assumption on the reward can be written as rt|(Ht−1, It = i) ∼ νi which also implies E[rt|Ht−1, It = i] = μi. In other words, given the history up to time t−1 and the choice of arm It, the reward is drawn independently with respect to the distribution of the chosen arm.
منابع مشابه
Multi-armed Bandit Problems with History
In a multi-armed bandit problem, at each time step, an algorithm chooses one of the possible arms and observes its rewards. The goal is to maximize the sum of rewards over all time steps (or to minimize the regret). In the conventional formulation of the problem, the algorithm has no prior knowledge about the arms. Many applications, however, provide some data about the arms even before the alg...
متن کاملAnalysis of Thompson Sampling for the Multi-armed Bandit Problem
The multi-armed bandit problem is a popular model for studying exploration/exploitation trade-off in sequential decision problems. Many algorithms are now available for this well-studied problem. One of the earliest algorithms, given by W. R. Thompson, dates back to 1933. This algorithm, referred to as Thompson Sampling, is a natural Bayesian algorithm. The basic idea is to choose an arm to pla...
متن کاملLecture 18: Stochastic Bandits
Last time we talked about the nonstochatsic bandit problem which was a partial information version of our online learning problem. Here we studied situations where at each iteration t, the learner chooses an action at and suffers loss `t(at) which is the only thing the learner observes. We showed that the importance weighting trick can be plugging into any full information algorithm with a loca...
متن کاملGap-free Bounds for Stochastic Multi-Armed Bandit
We consider the stochastic multi-armed bandit problem with unknown horizon. We present a randomized decision strategy which is based on updating a probability distribution through a stochastic mirror descent type algorithm. We consider separately two assumptions: nonnegative losses or arbitrary losses with an exponential moment condition. We prove optimal (up to logarithmic factors) gap-free bo...
متن کاملAsymptotically Optimal Multi-Armed Bandit Policies under a Cost Constraint
We develop asymptotically optimal policies for the multi armed bandit (MAB), problem, under a cost constraint. This model is applicable in situations where each sample (or activation) from a population (bandit) incurs a known bandit dependent cost. Successive samples from each population are iid random variables with unknown distribution. The objective is to have a feasible policy for deciding ...
متن کامل